Recently it has been shown that state-of-the-art NLP models are vulnerable to adversarial attacks, where the predictions of a model can be drastically altered by slight modifications to the input (such as synonym substitutions). While several defense techniques have been proposed, and adapted, to the discrete nature of text adversarial attacks, the benefits of general-purpose regularization methods such as label smoothing for language models, have not been studied. In this paper, we study the adversarial robustness provided by various label smoothing strategies in foundational models for diverse NLP tasks in both in-domain and out-of-domain settings. Our experiments show that label smoothing significantly improves adversarial robustness in pre-trained models like BERT, against various popular attacks. We also analyze the relationship between prediction confidence and robustness, showing that label smoothing reduces over-confident errors on adversarial examples.
translated by 谷歌翻译
我们将人机协作问题解决的问题视为一项计划任务,再加上自然语言交流。我们的框架由三个组成部分组成 - 一种自然语言引擎,将语言话语解析为正式代表,反之亦然,这是一个概念学习者,该概念学习者基于与用户的有限互动来诱导计划的广义概念,以及解决方案的HTN规划师,以解决该计划。基于人类互动的任务。我们说明了该框架通过在基于Minecraft的Blocksworld域中的协作构建任务中证明协作问题解决的关键挑战的能力。随附的演示视频可在https://youtu.be/q1pwe4aahf0上获得。
translated by 谷歌翻译
深度学习理论的基本追求是了解学习算法需要的重量空间中轨迹的性质。最近被隔离的一个这样的财产是“局部弹性”($ s _ {\ rm rel} $),它量化了采样数据点对另一个数据点预测的影响的传播。在这项工作中,我们通过在各种环境中提供新的理论见解和更加谨慎的经验证据,对本地弹性进行全面研究。首先,特定于分类设置,我们建议一个新的定义为$ s _ {\ rm} $的原始想法。通过实验在SVHN,CIFAR-10和CIFAR-100上进行最先进的神经网络训练,我们演示了我们的新$ S _ {\ rm} $检测到更新的重量更新的属性,以便在内部进行预测更改相同类的采样数据。接下来,我们通过神经网络的示例演示了原始$ s _ {\ rm rm} $显示2美元的阶段行为:当$ s _ {\ rm rm} $迅速变化时,他们的培训通过初始弹性阶段进行。当$ s _ {\ rm rel} $仍然很大时,最终的非弹性阶段。最后,我们通过渐变流提供多个学习的例子,其中一个人可以获得原始$ s _ {\ rm} $函数的封闭形式表达式。通过研究这些派生公式的曲线,我们给出了回归设置中的一些实验检测的属性的理论示范。
translated by 谷歌翻译
Reading comprehension of legal text can be a particularly challenging task due to the length and complexity of legal clauses and a shortage of expert-annotated datasets. To address this challenge, we introduce the Merger Agreement Understanding Dataset (MAUD), an expert-annotated reading comprehension dataset based on the American Bar Association's 2021 Public Target Deal Points Study, with over 39,000 examples and over 47,000 total annotations. Our fine-tuned Transformer baselines show promising results, with models performing well above random on most questions. However, on a large subset of questions, there is still room for significant improvement. As the only expert-annotated merger agreement dataset, MAUD is valuable as a benchmark for both the legal profession and the NLP community.
translated by 谷歌翻译
We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. When executing SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, we can reach 60% sparsity with negligible increase in perplexity: remarkably, more than 100 billion weights from these models can be ignored at inference time. SparseGPT generalizes to semi-structured (2:4 and 4:8) patterns, and is compatible with weight quantization approaches.
translated by 谷歌翻译
Despite the success of large language models (LLMs) in various natural language processing (NLP) tasks, the stored knowledge in these models may inevitably be incomplete, out-of-date, or incorrect. This motivates the need to utilize external knowledge to assist LLMs. Unfortunately, current methods for incorporating external knowledge often require additional training or fine-tuning, which can be costly and may not be feasible for LLMs. To address this issue, we propose a novel post-processing approach, rethinking with retrieval (RR), which retrieves relevant external knowledge based on the decomposed reasoning steps obtained from the chain-of-thought (CoT) prompting. This lightweight approach does not require additional training or fine-tuning and is not limited by the input length of LLMs. We evaluate the effectiveness of RR through extensive experiments with GPT-3 on three complex reasoning tasks: commonsense reasoning, temporal reasoning, and tabular reasoning. Our results show that RR can produce more faithful explanations and improve the performance of LLMs.
translated by 谷歌翻译
Model quantization enables the deployment of deep neural networks under resource-constrained devices. Vector quantization aims at reducing the model size by indexing model weights with full-precision embeddings, i.e., codewords, while the index needs to be restored to 32-bit during computation. Binary and other low-precision quantization methods can reduce the model size up to 32$\times$, however, at the cost of a considerable accuracy drop. In this paper, we propose an efficient framework for ternary quantization to produce smaller and more accurate compressed models. By integrating hyperspherical learning, pruning and reinitialization, our proposed Hyperspherical Quantization (HQ) method reduces the cosine distance between the full-precision and ternary weights, thus reducing the bias of the straight-through gradient estimator during ternary quantization. Compared with existing work at similar compression levels ($\sim$30$\times$, $\sim$40$\times$), our method significantly improves the test accuracy and reduces the model size.
translated by 谷歌翻译
Most existing pruning works are resource-intensive, requiring retraining or fine-tuning of the pruned models for accuracy. We propose a retraining-free pruning method based on hyperspherical learning and loss penalty terms. The proposed loss penalty term pushes some of the model weights far from zero, while the rest weight values are pushed near zero and can be safely pruned with no need for retraining and a negligible accuracy drop. In addition, our proposed method can instantly recover the accuracy of a pruned model by replacing the pruned values with their mean value. Our method obtains state-of-the-art results in retraining-free pruning and is evaluated on ResNet-18/50 and MobileNetV2 with ImageNet dataset. One can easily get a 50\% pruned ResNet18 model with a 0.47\% accuracy drop. With fine-tuning, the experiment results show that our method can significantly boost the accuracy of the pruned models compared with existing works. For example, the accuracy of a 70\% pruned (except the first convolutional layer) MobileNetV2 model only drops 3.5\%, much less than the 7\% $\sim$ 10\% accuracy drop with conventional methods.
translated by 谷歌翻译
Most of the existing works use projection functions for ternary quantization in discrete space. Scaling factors and thresholds are used in some cases to improve the model accuracy. However, the gradients used for optimization are inaccurate and result in a notable accuracy gap between the full precision and ternary models. To get more accurate gradients, some works gradually increase the discrete portion of the full precision weights in the forward propagation pass, e.g., using temperature-based Sigmoid function. Instead of directly performing ternary quantization in discrete space, we push full precision weights close to ternary ones through regularization term prior to ternary quantization. In addition, inspired by the temperature-based method, we introduce a re-scaling factor to obtain more accurate gradients by simulating the derivatives of Sigmoid function. The experimental results show that our method can significantly improve the accuracy of ternary quantization in both image classification and object detection tasks.
translated by 谷歌翻译
Question: Can an encoder-decoder architecture pretrained on a large dataset of longitudinal electronic health records improves patient outcome predictions? Findings: In this prognostic study of 6.8 million patients, our denoising sequence-to-sequence prediction model of multiple outcomes outperformed state-of-the-art models scuh pretrained BERT on a broad range of patient outcomes, including intentional self-harm and pancreatic cancer. Meaning: Deep bidirectional and autoregressive representation improves patient outcome prediction.
translated by 谷歌翻译